Introducing LiteRT Next: A new set of APIs that improves and simplifies on-device hardware acceleration.

GPU acceleration with LiteRT Next

Graphics Processing Units (GPUs) are commonly used for deep learning acceleration due to their massive parallel throughput compared to CPUs. LiteRT Next simplifies the process of using GPU acceleration by allowing users to specify the hardware acceleration as a parameter when creating a Compiled Model (CompiledModel). LiteRT Next also uses a new and improved GPU acceleration implementation, not offered by LiteRT.

With LiteRT Next's GPU acceleration, you can create GPU-friendly input and output buffers, achieve zero-copy with your data in GPU memory, and execute tasks asynchronously to maximize parallelism.

For example implementations of LiteRT Next with GPU support, refer to the following demo applications:

Add GPU dependency

Use the following steps to add GPU dependency to your Kotlin or C++ application.

Kotlin

For Kotlin users, the GPU accelerator is built-in and does not require additional steps beyond the Get Started guide.

C++

For C++ users, you must build the dependencies of the application with LiteRT GPU acceleration. The cc_binary rule that packages the core application logic (e.g., main.cc) requires the following runtime components:

LiteRT C API shared library: the data attribute must include the LiteRT C API shared library (//litert/c:litert_runtime_c_api_shared_lib) and GPU-specific components (@litert_gpu//:jni/arm64-v8a/libLiteRtGpuAccelerator.so).
Attribute dependencies: The deps attribute typically includes GLES dependencies gles_deps(), and linkopts typically includes gles_linkopts(). Both are highly relevant for GPU acceleration, since LiteRT often uses OpenGLES on Android.
Model files and other assets: Included through the data attribute.

The following is an example of a cc_binary rule:

cc_binary(
    name = "your_application",
    srcs = [
        "main.cc",
    ],
    data = [
        ...
        # litert c api shared library
        "//litert/c:litert_runtime_c_api_shared_lib",
        # GPU accelerator shared library
        "@litert_gpu//:jni/arm64-v8a/libLiteRtGpuAccelerator.so",
    ],
    linkopts = select({
        "@org_tensorflow//tensorflow:android": ["-landroid"],
        "//conditions:default": [],
    }) + gles_linkopts(), # gles link options
    deps = [
        ...
        "//litert/cc:litert_tensor_buffer", # litert cc library
        ...
    ] + gles_deps(), # gles dependencies
)

This setup allows your compiled binary to dynamically load and use the GPU for accelerated machine learning inference.

Get started

To get started using the GPU accelerator, pass the GPU parameter when creating the Compiled Model (CompiledModel). The following code snippet shows a basic implementation of the entire process:

C++

// 1. Load model
LITERT_ASSIGN_OR_RETURN(auto model, Model::CreateFromFile("mymodel.tflite"));

// 2. Create a compiled model targeting GPU
LITERT_ASSIGN_OR_RETURN(auto env, Environment::Create({}));
LITERT_ASSIGN_OR_RETURN(auto compiled_model, CompiledModel::Create(env, model, kLiteRtHwAcceleratorGpu));

// 3. Prepare input/output buffers
LITERT_ASSIGN_OR_RETURN(auto input_buffers, compiled_model.CreateInputBuffers());
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());

// 4. Fill input data (if you have CPU-based data)
input_buffers[0].Write<float>(absl::MakeConstSpan(cpu_data, data_size));

// 5. Execute
compiled_model.Run(input_buffers, output_buffers);

// 6. Access model output
std::vector<float> data(output_data_size);
output_buffers.Read<float>(absl::MakeSpan(data));

Kotlin

// Load model and initialize runtime
val  model =
    CompiledModel.create(
        context.assets,
        "mymodel.tflite",
        CompiledModel.Options(Accelerator.GPU),
        env,
    )

// Preallocate input/output buffers
val inputBuffers = model.createInputBuffers()
val outputBuffers = model.createOutputBuffers()

// Fill the first input
inputBuffers[0].writeFloat(FloatArray(data_size) { data_value /* your data */ })

// Invoke
model.run(inputBuffers, outputBuffers)

// Read the output
val outputFloatArray = outputBuffers[0].readFloat()

For more information, see the Get Started with C++ or Get Started with Kotlin guides.

LiteRT Next GPU Accelerator

The new GPU Accelerator, available only with LiteRT Next, is optimized to handle AI workloads, like large matrix multiplications and KV cache for LLMs, more efficiently than previous versions. The LiteRT Next GPU Accelerator features the following key improvements over the LiteRT version:

Extended Operator Coverage: Handle larger, more complex neural networks.
Better Buffer Interoperability: Enable direct usage of GPU buffers for camera frames, 2D textures, or large LLM states.
Async Execution support: Overlap CPU pre-processing with GPU inference.

Zero-copy with GPU acceleration

Using zero-copy enables a GPU to access data directly in its own memory without the need for the CPU to explicitly copy that data. By not copying data to and from CPU memory, zero-copy can significantly reduce end-to-end latency.

The following code is an example implementation of Zero-Copy GPU with OpenGL, an API for rendering vector graphics. The code passes images in the OpenGL buffer format directly to LiteRT Next:

// Suppose you have an OpenGL buffer consisting of:
// target (GLenum), id (GLuint), size_bytes (size_t), and offset (size_t)
// Load model and compile for GPU
LITERT_ASSIGN_OR_RETURN(auto model, Model::CreateFromFile("mymodel.tflite"));
LITERT_ASSIGN_OR_RETURN(auto env, Environment::Create({}));
LITERT_ASSIGN_OR_RETURN(auto compiled_model,
    CompiledModel::Create(env, model, kLiteRtHwAcceleratorGpu));

// Create a TensorBuffer that wraps the OpenGL buffer.
LITERT_ASSIGN_OR_RETURN(auto tensor_type, model.GetInputTensorType("input_tensor_name"));
LITERT_ASSIGN_OR_RETURN(auto gl_input_buffer, TensorBuffer::CreateFromGlBuffer(env,
    tensor_type, opengl_buffer.target, opengl_buffer.id, opengl_buffer.size_bytes, opengl_buffer.offset));
std::vector<TensorBuffer> input_buffers{gl_input_buffer};
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());

// Execute
compiled_model.Run(input_buffers, output_buffers);

// If your output is also GPU-backed, you can fetch an OpenCL buffer or re-wrap it as an OpenGL buffer:
LITERT_ASSIGN_OR_RETURN(auto out_cl_buffer, output_buffers[0].GetOpenClBuffer());

Asynchronous execution

LiteRT's asynchronous methods, like RunAsync(), let you schedule GPU inference while continuing other tasks using the CPU or the NPU. In complex pipelines, GPU is often used asynchronously alongside CPU or NPUs.

The following code snippet builds on the code provided in the Zero-copy GPU acceleration example. The code uses both CPU and GPU asynchronously and attaches a LiteRT Event to the input buffer. LiteRT Event is responsible for managing different types of synchronization primitives, and the following code creates a managed LiteRT Event object of type LiteRtEventTypeEglSyncFence. This Event object ensures that we don't read from the input buffer until the GPU is done. All this is done without involving the CPU.

LITERT_ASSIGN_OR_RETURN(auto env, Environment::Create({}));
LITERT_ASSIGN_OR_RETURN(auto compiled_model,
    CompiledModel::Create(env, model, kLiteRtHwAcceleratorGpu));

// 1. Prepare input buffer (OpenGL buffer)
LITERT_ASSIGN_OR_RETURN(auto gl_input,
    TensorBuffer::CreateFromGlBuffer(env, tensor_type, opengl_tex));
std::vector<TensorBuffer> inputs{gl_input};
LITERT_ASSIGN_OR_RETURN(auto outputs, compiled_model.CreateOutputBuffers());

// 2. If the GL buffer is in use, create and set an event object to synchronize with the GPU.
LITERT_ASSIGN_OR_RETURN(auto input_event,
    Event::CreateManagedEvent(env, LiteRtEventTypeEglSyncFence));
inputs[0].SetEvent(std::move(input_event));

// 3. Kick off the GPU inference
compiled_model.RunAsync(inputs, outputs);

// 4. Meanwhile, do other CPU work...
// CPU Stays busy ..

// 5. Access model output
std::vector<float> data(output_data_size);
outputs[0].Read<float>(absl::MakeSpan(data));

Supported models

LiteRT Next supports GPU acceleration with the following models. Benchmark results are based on tests run on a Samsung Galaxy S24 device.

Model	LiteRT GPU Acceleration	LiteRT GPU (ms)
hf_mms_300m	Fully delegated	19.6
hf_mobilevit_small	Fully delegated	8.7
hf_mobilevit_small_e2e	Fully delegated	8.0
hf_wav2vec2_base_960h	Fully delegated	9.1
hf_wav2vec2_base_960h_dynamic	Fully delegated	9.8
isnet	Fully delegated	43.1
timm_efficientnet	Fully delegated	3.7
timm_nfnet	Fully delegated	9.7
timm_regnety_120	Fully delegated	12.1
torchaudio_deepspeech	Fully delegated	4.6
torchaudio_wav2letter	Fully delegated	4.8
torchvision_alexnet	Fully delegated	3.3
torchvision_deeplabv3_mobilenet_v3_large	Fully delegated	5.7
torchvision_deeplabv3_resnet101	Fully delegated	35.1
torchvision_deeplabv3_resnet50	Fully delegated	24.5
torchvision_densenet121	Fully delegated	13.9
torchvision_efficientnet_b0	Fully delegated	3.6
torchvision_efficientnet_b1	Fully delegated	4.7
torchvision_efficientnet_b2	Fully delegated	5.0
torchvision_efficientnet_b3	Fully delegated	6.1
torchvision_efficientnet_b4	Fully delegated	7.6
torchvision_efficientnet_b5	Fully delegated	8.6
torchvision_efficientnet_b6	Fully delegated	11.2
torchvision_efficientnet_b7	Fully delegated	14.7
torchvision_fcn_resnet50	Fully delegated	19.9
torchvision_googlenet	Fully delegated	3.9
torchvision_inception_v3	Fully delegated	8.6
torchvision_lraspp_mobilenet_v3_large	Fully delegated	3.3
torchvision_mnasnet0_5	Fully delegated	2.4
torchvision_mobilenet_v2	Fully delegated	2.8
torchvision_mobilenet_v3_large	Fully delegated	2.8
torchvision_mobilenet_v3_small	Fully delegated	2.3
torchvision_resnet152	Fully delegated	15.0
torchvision_resnet18	Fully delegated	4.3
torchvision_resnet50	Fully delegated	6.9
torchvision_squeezenet1_0	Fully delegated	2.9
torchvision_squeezenet1_1	Fully delegated	2.5
torchvision_vgg16	Fully delegated	13.4
torchvision_wide_resnet101_2	Fully delegated	25.0
torchvision_wide_resnet50_2	Fully delegated	13.4
u2net_full	Fully delegated	98.3
u2net_lite	Fully delegated	51.4
hf_distil_whisper_small_no_cache	Partially delegated	251.9
hf_distilbert	Partially delegated	13.7
hf_tinyroberta_squad2	Partially delegated	17.1
hf_tinyroberta_squad2_dynamic_batch	Partially delegated	52.1
snapml_StyleTransferNet	Partially delegated	40.9
timm_efficientformer_l1	Partially delegated	17.6
timm_efficientformerv2_s0	Partially delegated	16.1
timm_pvt_v2_b1	Partially delegated	73.5
timm_pvt_v2_b3	Partially delegated	246.7
timm_resnest14d	Partially delegated	88.9
torchaudio_conformer	Partially delegated	21.5
torchvision_convnext_tiny	Partially delegated	8.2
torchvision_maxvit_t	Partially delegated	194.0
torchvision_shufflenet_v2	Partially delegated	9.5
torchvision_swin_tiny	Partially delegated	164.4
torchvision_video_resnet2plus1d_18	Partially delegated	6832.0
torchvision_video_swin3d_tiny	Partially delegated	2617.8
yolox_tiny	Partially delegated	11.2